-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rag evals][1/n] refactor base scoring fn & data schema check #664
Conversation
llama_stack/providers/utils/common/data_schema_validator_mixin.py
Outdated
Show resolved
Hide resolved
@@ -13,12 +13,51 @@ | |||
|
|||
class BaseScoringFn(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand why we have these base classes because we have already declared what our impls need to do in terms of datatypes in our APIs. so the datatype for a scoring function already exists. Let's say I am implementing a new scoring function -- why do I need another base class and inherit from there? If there is some utilities I need for implementing the functions, they would be just utils / free functions or in the worst case, some mixins.
Can you explain the need for base classes please? I am very allergic to inheritance as you and @raghotham knows :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class BaseScoringFn(ABC):
is mostly separated out based on feedback from Tejas for his use case without
registration (syncing with you offline).
For our llama-stack implementations, I agree we don't need this separate BaseScoringFn
, and could just use RegisteredBaseScoringFn
as mixins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanxi0830 I see, interesting. Thanks for the explanation. We can keep this as is for now.
@@ -0,0 +1,93 @@ | |||
# Copyright (c) Meta Platforms, Inc. and affiliates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can datasetio abstract schema validation instead of creating mixins to be used by scoring and evals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghotham can you explain what you mean by this with an example? how do you imagine this validation happening specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lg
# What does this PR do? - add more braintrust scoring functions for RAG eval - add tests for evaluating against context ## Test Plan ``` pytest -v -s -m braintrust_scoring_together_inference scoring/test_scoring.py ``` <img width="850" alt="image" src="https://github.com/user-attachments/assets/2f8f0693-ea13-422c-a183-f798faf86433" /> **Example Output** - https://gist.github.com/yanxi0830/2acf3b8b3e8132fda2a48b1f0a49711b <img width="827" alt="image" src="https://github.com/user-attachments/assets/9014b957-107c-4c23-bbc0-812cbd0b16da" /> <img width="436" alt="image" src="https://github.com/user-attachments/assets/21e9da17-f426-49b2-9113-855cab7b3d40" /> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.
What does this PR do?
See [rag evals][2/n] add more braintrust scoring fns for RAG eval #666 & [rag evals][3/n] add ability to eval retrieval + generation in agentic eval pipeline #668
Refactor BaseScoringFn to be just a minimal interface, add new RegistrableBaseScoring
Refactor data schema check
Refactor braintrust eval (more scoring fn added & tested in following PR)
Test Plan
Before submitting
Pull Request section?